References

Antoniadis, A., Berruyer, J., & Carmona, R. (1992). Régression non linéaire et applications. Economica.

Data

Data are available on the Kaggle website , URL : https://www.kaggle.com/code/maryamsayagh1/cardiovascular-disease-prediction/data

library(knitr)

knitr::opts_chunk$set(message = FALSE, warning = FALSE, fig.width = 9, fig.height = 9)

library(devtools)
## Le chargement a nécessité le package : usethis
library(rmarkdown)
library(tidyverse)
## ── Attaching packages
## ───────────────────────────────────────
## tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0     ✔ purrr   1.0.1
## ✔ tibble  3.1.8     ✔ dplyr   1.1.0
## ✔ tidyr   1.3.0     ✔ stringr 1.5.0
## ✔ readr   2.1.3     ✔ forcats 1.0.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(ggplot2)
library(FactoMineR)
library(InformationValue)
library(ISLR)
library(caret)
## Le chargement a nécessité le package : lattice
## 
## Attachement du package : 'caret'
## 
## Les objets suivants sont masqués depuis 'package:InformationValue':
## 
##     confusionMatrix, precision, recall, sensitivity, specificity
## 
## L'objet suivant est masqué depuis 'package:purrr':
## 
##     lift
# Read the dataset
df <- read_delim("~/Documents/Info_Stats/R/Kaggle/Cardiovascular/cardio_train.csv", delim = ";")
## Rows: 70000 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ";"
## dbl (13): id, age, gender, height, weight, ap_hi, ap_lo, cholesterol, gluc, ...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# df <- read_delim("C:/Documents and Settings/steph/Documents/CardiovascularDisease_LogisticRegression/cardio_train.csv", delim = ";")
df <- as_tibble(df)
df <- df %>% rename_with(toupper)
df
## # A tibble: 70,000 × 13
##       ID   AGE GENDER HEIGHT WEIGHT AP_HI AP_LO CHOLE…¹  GLUC SMOKE  ALCO ACTIVE
##    <dbl> <dbl>  <dbl>  <dbl>  <dbl> <dbl> <dbl>   <dbl> <dbl> <dbl> <dbl>  <dbl>
##  1     0 18393      2    168     62   110    80       1     1     0     0      1
##  2     1 20228      1    156     85   140    90       3     1     0     0      1
##  3     2 18857      1    165     64   130    70       3     1     0     0      0
##  4     3 17623      2    169     82   150   100       1     1     0     0      1
##  5     4 17474      1    156     56   100    60       1     1     0     0      0
##  6     8 21914      1    151     67   120    80       2     2     0     0      0
##  7     9 22113      1    157     93   130    80       3     1     0     0      1
##  8    12 22584      2    178     95   130    90       3     3     0     0      1
##  9    13 17668      1    158     71   110    70       1     1     0     0      1
## 10    14 19834      1    164     68   110    60       1     1     0     0      0
## # … with 69,990 more rows, 1 more variable: CARDIO <dbl>, and abbreviated
## #   variable name ¹​CHOLESTEROL
dim(df)
## [1] 70000    13

This work is based on a set of R functions especially built to fit and assess a logistic regression model. The response variable is the variable CARDIO which is binary. The predictors are quantitative and categorigal.

# Source the R addons That I have coded for the logistic regression under R
# devtools::source_url("https://github.com/stephaneLabs/Logistic_Addons/logisticRegressionAddons_20230209.R")
source("~/Documents/Info_Stats/R/Kaggle/Cardiovascular/logisticRegressionAddons_20230314.R")
## 
## Attachement du package : 'DescTools'
## Les objets suivants sont masqués depuis 'package:caret':
## 
##     MAE, RMSE
## ResourceSelection 0.3-5   2019-07-22
## Le chargement a nécessité le package : Matrix
## 
## Attachement du package : 'Matrix'
## Les objets suivants sont masqués depuis 'package:tidyr':
## 
##     expand, pack, unpack
## 
## Attachement du package : 'arules'
## L'objet suivant est masqué depuis 'package:dplyr':
## 
##     recode
## Les objets suivants sont masqués depuis 'package:base':
## 
##     abbreviate, write
## Le chargement a nécessité le package : carData
## 
## Attachement du package : 'car'
## L'objet suivant est masqué depuis 'package:arules':
## 
##     recode
## L'objet suivant est masqué depuis 'package:DescTools':
## 
##     Recode
## L'objet suivant est masqué depuis 'package:dplyr':
## 
##     recode
## L'objet suivant est masqué depuis 'package:purrr':
## 
##     some
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
#source("C:/Documents and Settings/steph/Documents/CardiovascularDisease_LogisticRegression/logisticRegressionAddons_20230314.R")
# Quick format of the data
df <- df %>% mutate(ID     = ID  %>% as.numeric(),
                    AGE    = round(as.numeric(AGE) / 365, 1),
                    HEIGHT = HEIGHT %>% as.numeric(),
                    WEIGHT = WEIGHT %>% as.numeric(),
                    BMI    = WEIGHT / (HEIGHT / 100)^2,
                    AP_HI  = as.numeric(AP_HI) / 10,
                    AP_LO  = as.numeric(AP_LO) / 10,
                    GENDER      = GENDER      %>% as.factor(),
                    CHOLESTEROL = CHOLESTEROL %>% as.factor(),
                    GLUC        = GLUC        %>% as.factor(),
                    SMOKE       = SMOKE       %>% as.factor(),
                    ALCO        = ALCO        %>% as.factor(),
                    ACTIVE      = ACTIVE      %>% as.factor(),
                    CARDIO      = CARDIO      %>% as.factor()
                    )

nrow_init <- nrow(df)

We compute here the BMI index, the approximative age in years, and we divide systolic and diastolic blood pressure by 10 to get the usual range that we know when taking blood pressure a the physician.

Data exploration : descriptive statistics of the continuous predictors

observations are considered as outliers when outside ]Q1 - 1.5 x IQR; Q3 + 1.5 x IQR[ range, where Q1 is the 1st Quartile, Q3 the 3rd quartile and IQR the inter-quartiles interval.

AGE

quantitativeVariableDescription(df = df, variableName = "AGE")
## Lenght of the variable :
## 70000
## Variable summary :
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   29.60   48.40   54.00   53.34   58.40   65.00
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

df <- removeOutliers(df, variable = "AGE")$without_df
## Number of rows of initial dataframe : 70000
## Number of rows of toiletted dataframe : 69996
## Percent of rows deleted : 1 %

HEIGHT

quantitativeVariableDescription(df = df, variableName = "HEIGHT")
## Lenght of the variable :
## 69996
## Variable summary :
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    55.0   159.0   165.0   164.4   170.0   250.0
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

WEIGHT

quantitativeVariableDescription(df = df, variableName = "WEIGHT")
## Lenght of the variable :
## 69996
## Variable summary :
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.00   65.00   72.00   74.21   82.00  200.00
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

BMI

quantitativeVariableDescription(df = df, variableName = "BMI")
## Lenght of the variable :
## 69996
## Variable summary :
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.472  23.875  26.378  27.557  30.222 298.667
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

df <- removeOutliers(df, variable = "BMI")$without_df
## Number of rows of initial dataframe : 69996
## Number of rows of toiletted dataframe : 68001
## Percent of rows deleted : 0.97 %
df <- removeOutliers(df, variable = "AP_HI")$without_df
## Number of rows of initial dataframe : 68001
## Number of rows of toiletted dataframe : 65059
## Percent of rows deleted : 0.96 %
df <- removeOutliers(df, variable = "AP_LO")$without_df
## Number of rows of initial dataframe : 65059
## Number of rows of toiletted dataframe : 62011
## Percent of rows deleted : 0.95 %

Removing observations with systolic blood pressure AP_HI inferior to diastolic blood pressure AP_LO

# systolic blood pressure AP_HI must be superior to diastolic blood pressure AP_LO
df <- df %>% filter(AP_HI > AP_LO)
nrow_clean <- nrow(df)
pct_clean  <- ((nrow_init - nrow_clean) / nrow_init) * 100

cat("The cleaned dataset represent ")
## The cleaned dataset represent
cat(pct_clean)
## 11.41571
cat("% of the initial data (in terms of number of rows of the initial dataframe)\n")
## % of the initial data (in terms of number of rows of the initial dataframe)

Pairs plot of the predictors

# Strange paterns wich make think that 0 are NA for some variables
df_quanti <- select(df, c(ID, AGE, HEIGHT, WEIGHT, BMI, AP_HI, AP_LO, CARDIO)) %>% column_to_rownames(var = "ID")

panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...)
{
    par(usr = c(0, 1, 0, 1))
    r <- abs(cor(x, y))
    txt <- format(c(r, 0.123456789), digits = digits)[1]
    txt <- paste0(prefix, txt)
    if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
    text(0.5, 0.5, txt, cex = cex.cor * r)
}

# pairs(df_quanti,  upper.panel = panel.cor, col = c("black", "red")[as.numeric(df_quanti$CARDIO)])

Re-plotting the modified variables

AGE cleaned

quantitativeVariableDescription(df = df, variableName = "AGE")
## Lenght of the variable :
## 62009
## Variable summary :
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   39.10   48.50   54.00   53.37   58.40   65.00
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

BMI Cleaned

quantitativeVariableDescription(df = df, variableName = "BMI")
## Lenght of the variable :
## 62009
## Variable summary :
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   14.48   23.88   26.22   27.02   29.75   39.74
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

AP_HI Cleaned

quantitativeVariableDescription(df = df, variableName = "AP_HI")
## Lenght of the variable :
## 62009
## Variable summary :
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.50   12.00   12.00   12.62   14.00   16.90
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

AP_LO Cleaned

quantitativeVariableDescription(df = df, variableName = "AP_LO")
## Lenght of the variable :
## 62009
## Variable summary :
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   6.600   8.000   8.000   8.164   9.000  10.400
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Pairs plot of the predictors

# Strange paterns wich make think that 0 are NA for some variables
df_quanti <- select(df, c(ID, AGE, HEIGHT, WEIGHT, BMI, AP_HI, AP_LO, CARDIO)) %>% column_to_rownames(var = "ID")

panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...)
{
    par(usr = c(0, 1, 0, 1))
    r <- abs(cor(x, y))
    txt <- format(c(r, 0.123456789), digits = digits)[1]
    txt <- paste0(prefix, txt)
    if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
    text(0.5, 0.5, txt, cex = cex.cor * r)
}

# pairs(df_quanti,  upper.panel = panel.cor, col = c("black", "red")[as.numeric(df_quanti$CARDIO)])

PCA of the quantitative predictors

# PCA of data, NA are replaced by the mean of the variable
# PCA replace missing values by the mean of the variable, some variable are highly correlated (colinearity problems ?)
df_PCA <- PCA(X = df_quanti, scale.unit = TRUE, quali.sup = 7, ncp = 6, graph = FALSE)

# 6 components explains 89% of the data
barplot(df_PCA$eig[,2])

kable(df_PCA$eig)
eigenvalue percentage of variance cumulative percentage of variance
comp 1 2.2698131 37.8302176 37.83022
comp 2 1.4042520 23.4042000 61.23442
comp 3 1.1395236 18.9920605 80.22648
comp 4 0.8895977 14.8266285 95.05311
comp 5 0.2930822 4.8847028 99.93781
comp 6 0.0037314 0.0621907 100.00000
# Discrimination of the 2 diagnostics are not very good in the 1st plane of the PCA
plot(df_PCA,axes = c(1,2), choix = "var")

plot(df_PCA,axes = c(1,2), choix = "ind", habillage = 7)

plot(df_PCA,axes = c(1,3), choix = "var")

plot(df_PCA,axes = c(1,3), choix = "ind", habillage = 7)

plot(df_PCA,axes = c(2,3), choix = "var")

plot(df_PCA,axes = c(2,3), choix = "ind", habillage = 7)

Checking the linearity of the relationship between odds and the quantitative predictors

We check here the relationship between the log odds of the diabetes and the predictors. A linear relationship has to be observed to use the predictor as quantitative. If the relationship is not quantitative a transformation or a transformation into factor has to be done.

To get the values for the data of our plot, groups are done with a discretization function and the means of the groups are plotted to see is we get a linear shape. The method used for discretization is the “cluster” option of the function “discretize” of the “arules” package. This method is based on k-means clustering.

AGE

# description between the log(odds) and the quantitative predictors (should be linear)
logOddsVsQuantitativePredictor(df = df, binaryResponse = "CARDIO", method = "cluster", breaks = 10, quantitativePredictor = "AGE")

HEIGHT

logOddsVsQuantitativePredictor(df = df, binaryResponse = "CARDIO", method = "cluster", breaks = 10, quantitativePredictor = "HEIGHT")

# Problem

WEIGHT

logOddsVsQuantitativePredictor(df = df, binaryResponse = "CARDIO", method = "cluster", breaks = 10, quantitativePredictor = "WEIGHT")

# Problem

BMI

logOddsVsQuantitativePredictor(df = df, binaryResponse = "CARDIO", method = "cluster", breaks = 10, quantitativePredictor = "BMI")

AP_HI

logOddsVsQuantitativePredictor(df = df, binaryResponse = "CARDIO", method = "cluster", breaks = 10, quantitativePredictor = "AP_HI")

# Acceptable

AP_LO

logOddsVsQuantitativePredictor(df = df, binaryResponse = "CARDIO", method = "cluster", breaks = 10, quantitativePredictor = "AP_LO")

# HEIGHT is discretized
df <- df %>% mutate(HEIGHT_grouped = as.factor(discretize(HEIGHT, method = "frequency", breaks = 3)))
attr(df$HEIGHT_grouped, "discretized:breaks") <- NULL
attr(df$HEIGHT_grouped, "discretized:method") <- NULL
HEIGHT_groupedFreq <- binaryResponseVSCategoricalPredictor(df = df, binaryResponse = "CARDIO",
                                                           categoricalPredictor = "HEIGHT_grouped")
# Frequencies
kable(HEIGHT_groupedFreq$frequencies)
[120,161) [161,168) [168,207]
0 9887 9808 11844
1 10084 9193 11193
# Proportions
kable(HEIGHT_groupedFreq$proportions)
[120,161) [161,168) [168,207] Sum
0 15.94 15.82 19.10 50.86
1 16.26 14.83 18.05 49.14
Sum 32.21 30.64 37.15 100.00
# Chi2 test of independence
print(HEIGHT_groupedFreq$independanceChi2)
## 
##  Pearson's Chi-squared test
## 
## data:  responseVsPredictorFrequencies
## X-squared = 21.823, df = 2, p-value = 1.825e-05
# Proportions conditioned by outcome
kable(HEIGHT_groupedFreq$ConditionnalProportions1)
[120,161) [161,168) [168,207] Sum
0 31.35 31.10 37.55 100
1 33.09 30.17 36.73 100
# Proportions conditioned by predictor
kable(HEIGHT_groupedFreq$ConditionnalProportions2)
[120,161) [161,168) [168,207]
0 49.51 51.62 51.41
1 50.49 48.38 48.59
Sum 100.00 100.00 100.00
plotProportionsResponseConditionnedByPredictor(HEIGHT_groupedFreq$ConditionnalProportions2, "HEIGHT")

GENDERFreq <- binaryResponseVSCategoricalPredictor(df = df, binaryResponse = "CARDIO",
                                                        categoricalPredictor = "GENDER")
GENDERFreq
## $frequencies
##           categoPred
## binaryResp     1     2
##          0 20259 11280
##          1 19546 10924
## 
## $proportions
##           categoPred
## binaryResp      1      2    Sum
##        0    32.67  18.19  50.86
##        1    31.52  17.62  49.14
##        Sum  64.19  35.81 100.00
## 
## $independanceChi2
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  responseVsPredictorFrequencies
## X-squared = 0.046658, df = 1, p-value = 0.829
## 
## 
## $ConditionnalProportions1
##           categoPred
## binaryResp      1      2    Sum
##          0  64.23  35.77 100.00
##          1  64.15  35.85 100.00
## 
## $ConditionnalProportions2
##           categoPred
## binaryResp     1     2
##        0    50.9  50.8
##        1    49.1  49.2
##        Sum 100.0 100.0
plotProportionsResponseConditionnedByPredictor(GENDERFreq$ConditionnalProportions2, "GENDER")

CHOLESTEROLFreq <- binaryResponseVSCategoricalPredictor(df = df, binaryResponse = "CARDIO",
                                                        categoricalPredictor = "CHOLESTEROL")
CHOLESTEROLFreq
## $frequencies
##           categoPred
## binaryResp     1     2     3
##          0 26560  3306  1673
##          1 20385  4802  5283
## 
## $proportions
##           categoPred
## binaryResp      1      2      3    Sum
##        0    42.83   5.33   2.70  50.86
##        1    32.87   7.74   8.52  49.14
##        Sum  75.71  13.08  11.22 100.00
## 
## $independanceChi2
## 
##  Pearson's Chi-squared test
## 
## data:  responseVsPredictorFrequencies
## X-squared = 2944.2, df = 2, p-value < 2.2e-16
## 
## 
## $ConditionnalProportions1
##           categoPred
## binaryResp      1      2      3    Sum
##          0  84.21  10.48   5.30 100.00
##          1  66.90  15.76  17.34 100.00
## 
## $ConditionnalProportions2
##           categoPred
## binaryResp      1      2      3
##        0    56.58  40.77  24.05
##        1    43.42  59.23  75.95
##        Sum 100.00 100.00 100.00
plotProportionsResponseConditionnedByPredictor(CHOLESTEROLFreq$ConditionnalProportions2, "CHOLESTEROL")

GLUCFreq <- binaryResponseVSCategoricalPredictor(df = df, binaryResponse = "CARDIO",
                                                 categoricalPredictor = "GLUC")
GLUCFreq
## $frequencies
##           categoPred
## binaryResp     1     2     3
##          0 27900  1834  1805
##          1 25104  2510  2856
## 
## $proportions
##           categoPred
## binaryResp      1      2      3    Sum
##        0    44.99   2.96   2.91  50.86
##        1    40.48   4.05   4.61  49.14
##        Sum  85.48   7.01   7.52 100.00
## 
## $independanceChi2
## 
##  Pearson's Chi-squared test
## 
## data:  responseVsPredictorFrequencies
## X-squared = 471.39, df = 2, p-value < 2.2e-16
## 
## 
## $ConditionnalProportions1
##           categoPred
## binaryResp      1      2      3    Sum
##          0  88.46   5.82   5.72 100.00
##          1  82.39   8.24   9.37 100.00
## 
## $ConditionnalProportions2
##           categoPred
## binaryResp      1      2      3
##        0    52.64  42.22  38.73
##        1    47.36  57.78  61.27
##        Sum 100.00 100.00 100.00
plotProportionsResponseConditionnedByPredictor(GLUCFreq$ConditionnalProportions2, "GLUC")

SMOKEFreq <- binaryResponseVSCategoricalPredictor(df = df, binaryResponse = "CARDIO",
                                                        categoricalPredictor = "SMOKE")
SMOKEFreq
## $frequencies
##           categoPred
## binaryResp     0     1
##          0 28575  2964
##          1 27935  2535
## 
## $proportions
##           categoPred
## binaryResp      0      1    Sum
##        0    46.08   4.78  50.86
##        1    45.05   4.09  49.14
##        Sum  91.13   8.87 100.00
## 
## $independanceChi2
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  responseVsPredictorFrequencies
## X-squared = 22.161, df = 1, p-value = 2.507e-06
## 
## 
## $ConditionnalProportions1
##           categoPred
## binaryResp      0      1    Sum
##          0  90.60   9.40 100.00
##          1  91.68   8.32 100.00
## 
## $ConditionnalProportions2
##           categoPred
## binaryResp      0      1
##        0    50.57  53.90
##        1    49.43  46.10
##        Sum 100.00 100.00
plotProportionsResponseConditionnedByPredictor(SMOKEFreq$ConditionnalProportions2, "SMOKE")

ALCOFreq <- binaryResponseVSCategoricalPredictor(df = df, binaryResponse = "CARDIO",
                                                        categoricalPredictor = "ALCO")
ALCOFreq
## $frequencies
##           categoPred
## binaryResp     0     1
##          0 29786  1753
##          1 28942  1528
## 
## $proportions
##           categoPred
## binaryResp      0      1    Sum
##        0    48.03   2.83  50.86
##        1    46.67   2.46  49.14
##        Sum  94.71   5.29 100.00
## 
## $independanceChi2
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  responseVsPredictorFrequencies
## X-squared = 9.0248, df = 1, p-value = 0.002663
## 
## 
## $ConditionnalProportions1
##           categoPred
## binaryResp      0      1    Sum
##          0  94.44   5.56 100.00
##          1  94.99   5.01 100.00
## 
## $ConditionnalProportions2
##           categoPred
## binaryResp      0      1
##        0    50.72  53.43
##        1    49.28  46.57
##        Sum 100.00 100.00
plotProportionsResponseConditionnedByPredictor(ALCOFreq$ConditionnalProportions2, "ALCO")

ACTIVEFreq <- binaryResponseVSCategoricalPredictor(df = df, binaryResponse = "CARDIO",
                                                        categoricalPredictor = "ACTIVE")
ACTIVEFreq
## $frequencies
##           categoPred
## binaryResp     0     1
##          0  5727 25812
##          1  6405 24065
## 
## $proportions
##           categoPred
## binaryResp      0      1    Sum
##        0     9.24  41.63  50.86
##        1    10.33  38.81  49.14
##        Sum  19.56  80.44 100.00
## 
## $independanceChi2
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  responseVsPredictorFrequencies
## X-squared = 80.494, df = 1, p-value < 2.2e-16
## 
## 
## $ConditionnalProportions1
##           categoPred
## binaryResp      0      1    Sum
##          0  18.16  81.84 100.00
##          1  21.02  78.98 100.00
## 
## $ConditionnalProportions2
##           categoPred
## binaryResp      0      1
##        0    47.21  51.75
##        1    52.79  48.25
##        Sum 100.00 100.00
plotProportionsResponseConditionnedByPredictor(ACTIVEFreq$ConditionnalProportions2, "ACTIVE")

# Splitting the data into a training and a testing set

# Rename the levels of the factors so that the train function works
make_factor <- function(x){return(factor(x, labels = make.names(levels(x))))}

# train data set
df <- df %>% mutate(CARDIO      = make_factor(CARDIO))
df <- df %>% mutate(GENDER      = make_factor(GENDER))
df <- df %>% mutate(CHOLESTEROL = make_factor(CHOLESTEROL))
df <- df %>% mutate(GLUC        = make_factor(GLUC))
df <- df %>% mutate(SMOKE       = make_factor(SMOKE))
df <- df %>% mutate(ALCO        = make_factor(ALCO))
df <- df %>% mutate(ACTIVE      = make_factor(ACTIVE))


#make this example reproducible
set.seed(1)


#Use 70% of dataset as training set and remaining 30% as testing set
sample_train   <- sample(c(TRUE, FALSE), nrow(df), replace = TRUE, prob = c(0.7,0.3))
df_train       <- df[sample_train, ]
df_test        <- df[!sample_train, ]

Fitting the null model

# Null model
m0 <- glm(CARDIO ~ 1, family=binomial(link = logit),
          data = df_train)
AIC(m0)
## [1] 60033.83

Fitting the initial model

# Initial model
m1 <- glm(CARDIO ~ AGE + GENDER + HEIGHT_grouped + WEIGHT + BMI + AP_HI + AP_LO + CHOLESTEROL + GLUC + SMOKE + ALCO + ACTIVE, family = binomial(link = logit), data = df_train)

Variance inflation factor (VIF)

VIFTable <- plotVIF(m1)

kable(VIFTable)
rowName GVIF Df GVIF^(1/(2*Df))
AGE 1.022810 1 1.011340
GENDER 1.489076 1 1.220277
HEIGHT_grouped 4.172220 2 1.429196
WEIGHT 13.435491 1 3.665446
BMI 12.432973 1 3.526042
AP_HI 1.759194 1 1.326346
AP_LO 1.742250 1 1.319943
CHOLESTEROL 1.512578 2 1.108995
GLUC 1.496330 2 1.106004
SMOKE 1.248375 1 1.117307
ALCO 1.144394 1 1.069764
ACTIVE 1.002561 1 1.001280

Some collinearity is detected with a VIF value over 10 for BMI and WEIGHT.

Deviance table

anova(m1, test = "Chisq")
## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: CARDIO
## 
## Terms added sequentially (first to last)
## 
## 
##                Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                           43313      60032              
## AGE             1   2357.1     43312      57675 < 2.2e-16 ***
## GENDER          1      0.1     43311      57675    0.8006    
## HEIGHT_grouped  2      1.6     43309      57673    0.4443    
## WEIGHT          1   1115.0     43308      56558 < 2.2e-16 ***
## BMI             1     39.7     43307      56518 3.025e-10 ***
## AP_HI           1   6911.4     43306      49607 < 2.2e-16 ***
## AP_LO           1     40.3     43305      49567 2.226e-10 ***
## CHOLESTEROL     2    674.8     43303      48892 < 2.2e-16 ***
## GLUC            2     56.5     43301      48835 5.261e-13 ***
## SMOKE           1     30.9     43300      48804 2.721e-08 ***
## ALCO            1     34.2     43299      48770 4.992e-09 ***
## ACTIVE          1     72.5     43298      48698 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

AIC Value

AIC(m1)
## [1] 48729.71

R2 values

print(computeR2s(m1, m0))
##   mcFadden_R2 adjMcFadden_R2    sas_R2 adjSas_R2    dev_R2
## 1   0.1888018      0.1882958 0.2302376 0.3070168 0.1888018

Area under the ROC curve

Cstat(m1)
## [1] 0.7860068

Hosmer and Lemeshow goodness of fit test

HoslemTest(m1)
## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  model$y, fitted(model)
## X-squared = 318.9, df = 8, p-value < 2.2e-16

Residual plots

residualPlots(m1, binaryresponse = "CARDIO")

Backward elimination

Backward elimination is seen as the less bad method for variable selection so let us try it. We discard WEIGHT not because of its p-value but because of collinearity with BMI. BMI seems a better compromise between WEIGHT and HEIGHT.

m <- m1
m <- update(m, . ~ . - WEIGHT)

Variance inflation factor (VIF)

plotVIF(m)

##           rowName     GVIF Df GVIF^(1/(2*Df))
## 1             AGE 1.019688  1        1.009796
## 2          GENDER 1.426185  1        1.194230
## 3  HEIGHT_grouped 1.348171  2        1.077547
## 4             BMI 1.082027  1        1.040205
## 5           AP_HI 1.759202  1        1.326349
## 6           AP_LO 1.741904  1        1.319812
## 7     CHOLESTEROL 1.512485  2        1.108978
## 8            GLUC 1.496195  2        1.105980
## 9           SMOKE 1.247997  1        1.117138
## 10           ALCO 1.144285  1        1.069713
## 11         ACTIVE 1.002320  1        1.001159

The VIF values are now < 2.4, the it’s OK.

Deviance table

anova(m, test = "Chisq")
## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: CARDIO
## 
## Terms added sequentially (first to last)
## 
## 
##                Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                           43313      60032              
## AGE             1   2357.1     43312      57675 < 2.2e-16 ***
## GENDER          1      0.1     43311      57675    0.8006    
## HEIGHT_grouped  2      1.6     43309      57673    0.4443    
## BMI             1   1142.6     43308      56530 < 2.2e-16 ***
## AP_HI           1   6920.1     43307      49610 < 2.2e-16 ***
## AP_LO           1     40.6     43306      49570 1.872e-10 ***
## CHOLESTEROL     2    675.7     43304      48894 < 2.2e-16 ***
## GLUC            2     56.5     43302      48837 5.366e-13 ***
## SMOKE           1     30.5     43301      48807 3.260e-08 ***
## ALCO            1     34.0     43300      48773 5.513e-09 ***
## ACTIVE          1     73.0     43299      48700 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We could discard GENDER considering it’s not a significant factor for this data, but it could be useful to review some litterature before discarding it definitively.

AIC Value

AIC(m)
## [1] 48729.92

R2 values

print(computeR2s(m, m0))
##   mcFadden_R2 adjMcFadden_R2    sas_R2 adjSas_R2    dev_R2
## 1   0.1887651      0.1882924 0.2301984 0.3069645 0.1887651

Area under the ROC curve

Cstat(m)
## [1] 0.7859752

Hosmer and Lemeshow goodness of fit test

HoslemTest(m)
## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  model$y, fitted(model)
## X-squared = 321.35, df = 8, p-value < 2.2e-16

Residual plots

residualPlots(m, binaryresponse = "CARDIO")

Interpretting the Odds ratios

Exponentiation of the estimated coefficients of the confidence intervals

Odds ratios values with confidence intervals

oddsRatios   <- exp(coef(m))
oddsRatios   <- data.frame(oddsRatios = oddsRatios) %>% rownames_to_column(var = "rowname")
oddsRatiosCI <- as.data.frame(exp(confint(m))) %>% rownames_to_column(var = "rowname")
## Attente de la réalisation du profilage...
oddsRatios   <- full_join(oddsRatios, oddsRatiosCI)
## Joining with `by = join_by(rowname)`
names(oddsRatios) <- c("rowname", "oddsRatios", "Lower2.5pct", "Upper97.5pct") 
kable(oddsRatios)
rowname oddsRatios Lower2.5pct Upper97.5pct
(Intercept) 0.0000028 0.0000020 0.0000039
AGE 1.0516232 1.0481157 1.0551504
GENDERX2 0.9436157 0.8942425 0.9956903
HEIGHT_grouped[161,168) 1.0707462 1.0127119 1.1321311
HEIGHT_grouped[168,207] 1.1353205 1.0692226 1.2055453
BMI 1.0325169 1.0271331 1.0379333
AP_HI 1.9255924 1.8775234 1.9752699
AP_LO 1.1317975 1.0852240 1.1803339
CHOLESTEROLX2 1.4561624 1.3605421 1.5586241
CHOLESTEROLX3 3.0670461 2.8058635 3.3552141
GLUCX2 1.0112132 0.9237890 1.1070037
GLUCX3 0.6807467 0.6171558 0.7506783
SMOKEX1 0.8653778 0.7940137 0.9430346
ALCOX1 0.7385722 0.6644722 0.8207208
ACTIVEX1 0.7902923 0.7487356 0.8341414

Odds ratios values with confidence intervals (graphic)

plotOddsRatios(oddsRatiosDf = oddsRatios)

Looking at the odds ratios for ACTIVE, it seems that level 1 has odds 20.9% less as important thant for level 0. It is counter intuitive but level 1 for ALCO and SMOKE shows respectively less important odds than for levels 0 (about 26,14% less for ALCO and 13.46% less for SMOKE). Is it a problem related to the levels coding ? AP_HI has a strong influence as the odds of the disease strongly increase (92.56%) with an increase of 1 point of AP_HI. It is also the case for AP_LO but in s less important way (about 13.18% per unit of AP_LO). BMI increases the odds of 3.25% by unit of BMI. CHOLESTEROL 2 has odds 45.62% higher than the level 1 and even worse CHOLESTEROL 3 has increased the odds by 206.70% compared to level 1. GENER seems to have a little influence in this data with level 2 having odds 5.64% smaller than level 1. GLUC 3 has odds 31.93% smaller than the level GLUC 1 while the HEIGHT increases the odds by 7.07% for level [161,168) compared to the reference level (under 161cm), and level [168,207] shows odds 13.53% greater to those of the reference level (under 161cm).

Roc curves for the training and the testing sets

model <- m
resultsROC <- rocCurve(model, df_train = df_train, df_test = df_test, outcomeVariable = "CARDIO")
## Training data set
## -----------------
## Dimensions of the testing data set : 43314 15
## length of the outcome variable : 43314
## length of the predicted variable : 43314
## Train data set
## Optimization method : Youden metric
## Training data set
## -----------------
## Dimensions of the testing data set : 18695 15
## length of the outcome variable : 18695
## length of the predicted variable : 18695
## Test data set
## Optimization method : Youden metric
## Setting levels: control = X0, case = X1
## Setting direction: controls < cases
## Setting levels: control = X0, case = X1
## Setting direction: controls < cases
## Setting levels: control = X0, case = X1
## Setting direction: controls < cases
## Setting levels: control = X0, case = X1
## Setting direction: controls < cases

kable(resultsROC$optim_train)
FPR TPR MCER threshold YoudenMetric TLCMetric
46 0.2868899 0.7347605 0.2758692 0.45 0.4478707 0.3907144
kable(resultsROC$optim_test)
FPR TPR MCER threshold YoudenMetric TLCMetric
48 0.3083442 0.7543989 0.276491 0.47 0.4460547 0.394203